Spark: Data Protection

Replication :- In Hadoop, the Read-Write operations can be performed directly by the client in order to perform read/write data from/to HDFS. Client can directly write the data on slave node(s) based on information provided by Master (namenode). Client can write only one copy of a particular data/block on datanode. HDFS provides a reliable way to store huge data in a distributed environment as data blocks. The blocks are also replicated to provide fault tolerance. The default replication factor is 3 which is again configurable. So, as you can see in the figure below where each block is replicated three times and stored on different DataNodes (considering the default replication factor):

Once the block is completely written, then that datanode start copying this block to another datanode. This process is continued until we achieved desired replication block on different data nodes. Duplicate copies won’t be created on the same datanode, That particular datanode will create a replica block on different nodes. On which datanode this particular block need to be replicated will be decided by Master node only. There will communication available between Master and Slave nodes via “Block report form”. In the same way, client can directly read the data from slave node(s) based on information provided by Master(Namenode).

Rack Awareness in Hadoop
Rack Awareness in Hadoop is the concept that chooses closer Datanodes based on the rack information. To improve network traffic while reading/writing HDFS files in large clusters of Hadoop.By default, Hadoop installation assumes that all the nodes belong to the same rack. NameNode chooses data nodes, which are on the same rack or a nearby rock to read/ write requests. HDFS Namenode achieves this rack information by maintaining rack ids of each data node.

Consider replication factor is 3 for data blocks on HDFS it means for every block of data two copies are stored on the same rack, while the third copy is stored on a different rack. This rule is called Replica Placement Policy.

Why Rack Awareness?
The main purpose of Rack awareness is to:

Improve data reliability and data availability.
Better cluster performance.
Prevents data loss if the entire rack fails.
To improve network bandwidth.
Keep the bulk flow in-rack when possible.

Replica placement via Rack Awareness in Hadoop
The main purpose of replica placement via Rack awareness, the policy is to improve data reliability etc. A simple policy is to place replicas on the rack to prevent losing of data when an entire rack fails. And allow the use of bandwidth from multiple racks when reading a file.

On multiple rack clusters, block replication follows the below policy:

You should not place more than one replica on one node. You should also not place more than two replicas on the same rack. This has a bottleneck that number of racks used for block replication should be always less than the total number of block replicas.

For example;

When a Hadoop framework creates new block, it places first replica on the local node. And place a second one in a different rack, and the third one is on different node on the local node. When re-replicating a block, if the number of existing replicas is one, place the second on a different rack. When number of existing replicas are two, if the two replicas are in the same rack, place the third one on a different rack.

File Deletes and Undeletes

When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.

A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current default policy is to delete files from /trash that are more than 6 hours old. In the future, this policy will be configurable through a well defined interface.

Decrease Replication Factor

When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster.

Spark

Data Protection

No comments:

Post a Comment